An Introduction to Logistic Regression in Python

Machine learning has revolutionized the world of business and is helping us build sophisticated applications to solve tough business problems. Using supervised and unsupervised machine learning models, you can solve problems using classification, regression, and clustering algorithms. In this article, we’ll discuss a supervised machine learning algorithm known as logistic regression in Python. Logistic regression can be used to solve both classification and regression problems. 

Your AI/ML Career is Just Around The Corner!

AI Engineer Master's ProgramExplore Program
Your AI/ML Career is Just Around The Corner!

What is Logistic Regression?

Logistic regression machine learning is a statistical method that is used for building machine learning models where the dependent variable is dichotomous: i.e. binary. Logistic regression is used to describe data and the relationship between one dependent variable and one or more independent variables. The independent variables can be nominal, ordinal, or of interval type.

The name “logistic regression” is derived from the concept of the logistic function that it uses. The logistic function is also known as the sigmoid function. The value of this logistic function lies between zero and one.

The following is an example of a logistic function we can use to find the probability of a vehicle breaking down, depending on how many years it has been since it was serviced last.

years

Here is how you can interpret the results from the graph to decide whether the vehicle will break down or not.

  • Identify Key Metrics

  • Understand Threshold Values

  • Analyze Trend Patterns

  • Compare Current Data to Baseline

  • Look for Warning Signs

  • Consider Historical Data

  • Assess Overall Condition

  • Make a Decision

Logistic Regression in Machine Learning

Logistic regression machine learning is a key classification technique. It can work with both numerical and categorical data, making it versatile for various applications. For example, it’s commonly used to predict whether a customer will leave a service (churn), identify fraudulent transactions, or determine if a patient has a specific condition.

One of the main advantages of logistic regression is its simplicity. Logistic regression machine learning not only predicts outcomes but also helps understand which factors are most important for these predictions. This makes logistic regression a practical tool for solving classification problems while providing clear insights into the data. Its ease of use and interpretability make it popular in many machine-learning projects.

Logistic Function - Sigmoid Function

The sigmoid or logistic function is essential for converting predicted values into probabilities in logistic regression. This function maps any real number to a value between 0 and 1, ensuring that predictions remain within this probability range. Its "S" shaped curve helps translate raw scores into a more interpretable format.

A threshold value is used in logistic regression to make decisions based on these probabilities. For instance, if the predicted probability is above a certain threshold, such as 0.5, the result is 1. If it’s below, it’s classified as 0. This approach allows for clear and actionable outcomes, such as determining whether a customer will purchase a product or a patient has a particular condition based on the probability calculated by the sigmoid function.

Types of Logistic Regression

Logistic regression is a versatile machine learning algorithm used for binary and multi-class classification tasks. Depending on the nature of the dependent variable, logistic regression can be categorized into different types. The main types of logistic regression include Binary Logistic Regression, Multinomial Logistic Regression, and Ordinal Logistic Regression.

Binary Logistic Regression

Binary logistic regression is the most common type of logistic regression, where the dependent variable has only two possible outcomes or classes, typically represented as 0 and 1. It is used when the target variable is binary, such as yes/no, pass/fail, or true/false. The logistic function in binary logistic regression models the probability of an observation belonging to one of the two classes.

Multinomial Logistic Regression

Multinomial logistic regression, also known as softmax regression, is used when the dependent variable has more than two unordered categories. Unlike binary logistic regression, which deals with binary outcomes, multinomial logistic regression can handle multiple classes simultaneously. It models the probability of an observation belonging to each class using the softmax function, which ensures that the predicted probabilities sum up to one across all classes.

Ordinal Logistic Regression

Ordinal logistic regression is employed when the dependent variable has more than two ordered categories. In other words, the outcome variable has a natural ordering or hierarchy among its categories. Examples include ordinal scales like low, medium, and high, or Likert scale responses ranging from strongly disagree to strongly agree. Ordinal logistic regression models the cumulative probabilities of an observation falling into or below each category using the cumulative logistic distribution function.

Assumption in a Logistic Regression Algorithm

  • In a binary logistic regression, the dependent variable must be binary
  • For a binary regression, the factor level one of the dependent variables should represent the desired outcome
  • Only meaningful variables should be included
  • The independent variables should be independent of each other. This means the model should have little or no multicollinearity
  • The independent variables are linearly related to the log odds
  • Logistic regression requires quite large sample sizes

Logistic Regression Equation

The logistic regression equation is:

y= 11+ e(b0+b1x)

where x is the input value, y is the predicted probability, b0 is the intercept, and b1​ is the coefficient for the input x. This equation models the probability of a binary outcome based on a linear combination of input features.

Properties of Logistic Regression Equation

Logistic regression comes with a few key characteristics that define how it works and how it’s assessed:

  • Bernoulli Distribution

In logistic regression, the predicted outcome is binary, meaning it follows a Bernoulli distribution. This simply means that the result can only be one of two possible values, like "yes" or "no," "success" or "failure." This fits perfectly with logistic regression’s goal of classifying data into two categories.

  • Maximum Likelihood Estimation

The maximum likelihood method is used to find the best-fit parameters for a logistic regression model. This technique identifies the parameter values that make the observed data most probable. In other words, it adjusts the model to match the data it has seen best, which helps make accurate predictions.

  • Concordance for Model Fit

Instead of using R squared like in linear regression to measure how well the model fits the data, logistic regression algorithms use concordance. Concordance assesses how well the model ranks the predicted probabilities. It checks whether the model is good at ordering outcomes correctly rather than just fitting the data. This approach is more beneficial for classification tasks where the goal is to predict which category something belongs to.

Key Terminologies of Logistic Regression

Apart from the properties of logistic regression, several key terms are crucial for understanding how the logistic regression machine learning model works:

  • Independent Variables

These are the features or factors used to predict the model's outcome. They are the inputs that help determine the value of the dependent variable. For instance, independent variables might include age, income, and past buying behavior in a model predicting whether a customer will purchase a product.

  • Dependent Variable

This is the outcome the model is trying to predict. In logistic regression, the dependent variable is binary, meaning it has two possible values, such as "yes" or "no," "spam" or "not spam." The goal is to estimate the probability of this variable being in one category versus the other.

  • Logistic Function

The logistic function is a formula that converts the model’s input into a probability score between 0 and 1. This score indicates the likelihood of the dependent variable being 1. It’s what turns the raw predictions into meaningful probabilities that can be used for classification.

  • Odds

Odds represent the ratio of the probability of an event happening to the probability of it not happening. For example, if there’s a 75% chance of an event occurring, the odds are 3 to 1. This concept helps to understand how likely an event is compared to it not happening.

  • Log-Odds

Log-odds, or the logit function, is the natural logarithm of the odds. In logistic regression, the relationship between the independent variables and the dependent variable is expressed through log-odds. This helps model how changes in the independent variables affect the likelihood of the outcome.

  • Coefficient

Coefficients are the values that show how each independent variable influences the dependent variable. They indicate the strength and direction of the relationship. For example, a positive coefficient means that as the independent variable increases, the likelihood of the dependent variable being 1 also increases.

  • Intercept

The intercept is a constant term in the model representing the dependent variable's log odds when all the independent variables are zero. It provides a baseline level of the dependent variable’s probability before considering the effects of the independent variables.

  • Maximum Likelihood Estimation

Maximum likelihood estimation (MLE) is the method used to find the best-fitting coefficients for the model. It determines the values that make the observed data most probable under the logistic regression framework, ensuring the model provides the most accurate predictions based on the given data.

How Does the Logistic Regression Algorithm Work?

Consider the following example: An organization wants to determine an employee’s salary increase based on their performance.

For this purpose, a linear regression algorithm will help them decide. Plotting a regression line by considering the employee’s performance as the independent variable, and the salary increase as the dependent variable will make their task easier.

salary-hike

Now, what if the organization wants to know whether an employee would get a promotion or not based on their performance? The above linear graph won’t be suitable in this case. As such, we clip the line at zero and one, and convert it into a sigmoid curve (S curve).

employee-rating

Based on the threshold values, the organization can decide whether an employee will get a salary increase or not.

To understand logistic regression, let’s go over the odds of success.

Odds (𝜃) = Probability of an event happening / Probability of an event not happening

𝜃 = p / 1 - p

The values of odds range from zero to ∞ and the values of probability lies between zero and one.

Consider the equation of a straight line: 

𝑦 = 𝛽0 + 𝛽1* 𝑥

equation

Here, 𝛽0 is the y-intercept

𝛽1 is the slope of the line

x is the value of the x coordinate

y is the value of the prediction

Now to predict the odds of success, we use the following formula:

log

Exponentiating both the sides, we have:

et-y

Let Y = e 𝛽0+𝛽1 * 𝑥

Then p(x) / 1 - p(x) = Y

p(x) = Y(1 - p(x))

p(x) = Y - Y(p(x))

p(x) + Y(p(x)) = Y

p(x)(1+Y) = Y

p(x) = Y / 1+Y

px

The equation of the sigmoid function is:

The sigmoid curve obtained from the above equation is as follows:

Now that you know more about logistic regression algorithms, let’s look at the difference between linear regression and logistic regression.

Advantages of the Logistic Regression Algorithm

  • Logistic regression performs better when the data is linearly separable
  • It does not require too many computational resources as it’s highly interpretable
  • There is no problem scaling the input features—It does not require tuning
  • It is easy to implement and train a model using logistic regression
  • It gives a measure of how relevant a predictor (coefficient size) is, and its direction of association (positive or negative).

Code Implementation for Logistic Regression

Let's explore how logistic regression can be implemented for different scenarios. Depending on the nature of the target variable, logistic regression can be used for both binomial and multinomial classification problems.

  • Binomial Logistic Regression

In a binomial logistic regression, the target variable has only two possible outcomes, such as "accepted" vs. "rejected," "approved" vs. "denied," or "positive" vs. "negative." A practical example is predicting whether a loan application will be approved based on various applicant features like income, credit score, and employment history.

To implement this, the necessary libraries are imported, and a dataset of loan applications is used to train the model. The data is split into training and testing sets to evaluate the model’s performance. After training the logistic regression model on the training data, it predicts the outcomes for the test data. The accuracy of the model is then calculated to measure its effectiveness. This provides a clear understanding of how the model predicts loan approvals.

  • Multinomial Logistic Regression

Multinomial logistic regression is used when the target variable can have three or more possible outcomes, and these outcomes are not in any specific order. For example, consider a situation where we are classifying diseases into three categories: "disease A," "disease B," and "disease C." Here, logistic regression can help predict the likelihood of each category.

In this example, the Digit Dataset is used to classify handwritten digits (0-9). Unlike binomial logistic regression, the data is split into training and testing sets. After training the model, predictions are made on the test data, and the model’s accuracy is evaluated. In this case, the model achieved an impressive accuracy of 96.52%.

Python Implementation of Logistic Regression with Example

Here’s how to use logistic regression in Python to predict SUV purchases based on user age and salary:

  • Data Preparation

Start by loading your dataset, which includes user details like age and salary. From this dataset, you'll identify the columns representing the features (age and salary) and the target variable (whether the user will purchase the SUV). The next step is to split this data into two parts: a training set to build the model and a test set to evaluate its performance. This separation ensures that the model is tested on new, unseen data.

  • Training the Model

With your data prepared, you can now train the logistic regression model. Logistic regression is well-suited for binary outcomes, like predicting whether a user will buy the SUV (yes or no). During training, the model learns from the data by finding patterns in age and salary that help distinguish between users who are likely to purchase the SUV and those who are not.

  • Making Predictions

Once the model is trained, it predicts outcomes on the test set. This means applying the model to new data to estimate the likelihood that each user will purchase the SUV. These predictions are then compared to actual outcomes to determine how accurate the model is.

  • Evaluating Performance

To evaluate the model’s accuracy, you’ll create a confusion matrix. This matrix shows how many predictions were correct and how many were incorrect. It compares the predicted results with the purchase decisions, helping you understand how well the model performs.

  • Visualizing Results

Finally, visualize the results to see how well the model has done. Plotting the data and predictions provides a clear picture of how the model separates users who are likely to buy the SUV from those who aren’t. This visualization makes interpreting the model’s performance and seeing its practical implications easier.

Evaluate Logistic Regression Model

Evaluating a logistic regression model involves several key metrics that assess its performance:

  • Accuracy

Accuracy measures how often the model correctly classifies instances. It gives the proportion of correct predictions (both true positives and true negatives) out of all predictions made. High accuracy means the model is generally reliable at predicting the correct class.

  • Precision

Precision focuses on the correctness of positive predictions. It tells you how many of the instances predicted as positive by the model are positive. This metric is especially important when the cost of false positives (incorrectly predicting a negative instance as positive) is high.

  • Recall

Recall, also known as sensitivity or the true positive rate, evaluates how well the model identifies actual positive instances. It shows the proportion of actual positives that the model correctly predicts. High recall indicates that the model identifies positive cases, even if it means more false positives.

  • F1 Score

The F1 Score is a combined measure of precision and recall, providing a single value that balances both metrics. It is useful when the trade-off between precision and recall must be balanced, especially in situations where both false positives and false negatives are important.

  • Area Under the Receiver Operating Characteristic Curve (AUC-ROC)

The AUC-ROC assesses the model’s ability to distinguish between positive and negative classes across various thresholds. It reflects how well the model performs overall in classifying different classes, with a higher score indicating better performance.

  • Area Under the Precision-Recall Curve (AUC-PR)

The AUC-PR measures the model’s precision and recall performance across different levels. It is handy for evaluating models on imbalanced datasets where one class is much more common than the other.

Precision-Recall Tradeoff in Logistic Regression Threshold Setting

Let’s now look at how setting the decision threshold in logistic regression affects the balance between precision and recall.

  • Low Precision/High Recall

When it's crucial to catch as many positive cases as possible, even if it means a higher number of false positives, a threshold that boosts recall is the way to go. For example, in medical tests for serious conditions like cancer, you want to identify every possible case, even if some healthy people are mistakenly diagnosed as having the disease. Using a lower threshold increases the chances of detecting all potential cases, which is vital for early intervention, despite the trade-off of some incorrect positive results.

  • High Precision/Low Recall

On the other hand, if your priority is to avoid false positives and you can accept missing some true positives, you should choose a threshold that enhances precision. For instance, in targeted advertising, you want to be sure that those identified as likely to respond positively will actually do so. A higher threshold means that only those with a strong likelihood of responding are selected, reducing the risk of spending resources on people who are unlikely to engage, even though it might lead to missing some genuine positive responses.

Linear Regression vs. Logistic Regression

Linear Regression

Logistic Regression

Used to solve regression problems

Used to solve classification problems

The response variables are continuous in nature

The response variable is categorical in nature

It helps estimate the dependent variable when there is a change in the independent variable

It helps to calculate the possibility of a particular event taking place

It is a straight line

It is an S-curve (S = Sigmoid)

Logistic Regression Best Practices

Here are the key best practices for ensuring that logistic regression models are accurate and effective:

  • Identify Dependent Variables to Ensure Model Consistency

In logistic regression, it's essential to have an inherently binary dependent variable. This means it should naturally fall into one of two categories, such as "yes/no," "disease/no disease," or "successful/unsuccessful." 

For example, in healthcare research, outcomes like "cancerous/non-cancerous" fit well with logistic regression. However, it's crucial to avoid turning continuous variables (like income) into binary categories (e.g., "rich" versus "poor") without a strong justification. Doing so can lead to a significant loss of information and reduce the model's effectiveness, as this recoding oversimplifies complex data and might obscure important nuances.

  • Discover the Technical Requirements of the Model

Logistic regression needs careful attention to several technical factors to work properly and efficiently:

  • Increase the Number of Observations

More data generally leads to more reliable and stable estimates. Larger sample sizes help reduce the impact of multicollinearity (when independent variables are highly correlated), which can skew results.

  • Use Data Reduction Techniques

Techniques like principal component analysis (PCA) can help manage multicollinearity by combining correlated variables into fewer synthetic measures. This reduces redundancy and improves model performance.

  • Monitor Sample Size

Small sample sizes can lead to inaccurate and unstable estimates. Ensuring a sufficiently large sample size is crucial for obtaining reliable results.

  • Exclude Extreme Outliers

Outliers can disproportionately affect the model's coefficients. Identifying and removing these outliers helps create a model that better represents most of the data and improves overall fit.

  • Estimate the Model and Evaluate the Goodness of Fit

Accurate model estimation involves several key steps that ensure the robustness and reliability of the logistic regression model:

  • Transparency

Report all relevant details about the model, including the software and data used. Access to the original data and computational scripts enhances replicability and allows others to verify the results.

  • Goodness-of-Fit Evaluation

Assess how well the model fits the data by comparing it to a null model, which includes only the intercept and no predictors. This comparison helps determine if the logistic regression model significantly improves predictions over a simple baseline model. A model with a better fit will generally provide more accurate predictions and insights.

  • Appropriately Interpret the Results

Interpreting logistic regression results requires understanding the coefficients expressed in terms of odds ratios rather than raw values. Here’s how to interpret them:

  • Odds Ratios

A coefficient represents how changes in an independent variable affect the odds of the dependent variable occurring. For example, a coefficient of 0.4 suggests that a one-unit increase in the independent variable corresponds to an increase of 0.4 in the log odds of the dependent variable.

  • Contextual Explanation

Unlike linear regression, where coefficients directly show the impact on the dependent variable, logistic regression coefficients must be explained in terms of odds. This means understanding and describing how each predictor influences the likelihood of the outcome occurring.

  • Validate Observed Results

Validation is crucial to ensure the model's findings are reliable and applicable. Here is how it can be effectively carried out:

  • Use a Subsample

Test the model on a subsample of the original dataset to assess its performance. This helps confirm that the model’s predictions are specific to the training data and can be generalized to other datasets.

  • External Validity

This practice assesses whether the results can be applied to other populations or settings. It ensures that the model’s findings are robust and not limited to the particular sample used for training. Validating with different data helps confirm that the model’s predictions are accurate and generalizable.

Applications of Logistic Regression

Let's explore some of the most common applications of logistic regression across different industries:

  • Optical Character Recognition (OCR)

Optical Character Recognition (OCR) is a process that converts handwritten or printed characters into digital text, making it readable by computers. Since OCR involves identifying specific characters from possible outcomes, it qualifies as a classification task in machine learning. 

Logistic regression machine learning is instrumental in this context, where it helps classify characters as present or absent in an image. Features such as lines, curves, and edges extracted from the image serve as input variables, while logistic regression estimates the likelihood of character presence. By applying the model to new images, accurate character recognition becomes possible.

  • Fraud Detection

Fraud detection identifies and prevents deceptive activities, particularly in finance, insurance, and e-commerce. Logistic regression is a powerful tool for detecting fraudulent transactions by classifying them as either legitimate or fraudulent. The model uses independent variables such as transaction value, location, time, and user information to predict the likelihood of fraud. Organizations can significantly improve their ability to spot and stop fraudulent activities by combining logistic regression with other methods like anomaly detection.

  • Disease Spread Prediction

Predicting the spread of diseases can also be approached as a classification problem, where the goal is to determine whether an individual is likely to contract a disease. Logistic regression helps model the relationship between population demographics, health conditions, environmental factors, medical resource availability, and the probability of disease transmission. Using historical data, logistic regression can predict disease spread patterns, helping public health officials respond more effectively. Combining logistic regression with time series analysis and clustering techniques can further enhance the accuracy of predictions.

  • Illness Mortality Prediction

In healthcare, logistic regression is often used to predict mortality in patients suffering from specific illnesses. The model is trained using data on patient demographics, health status, and clinical indicators such as age, gender, and vital signs. By analyzing these variables, logistic regression can estimate the probability of a patient dying from the illness. This enables medical professionals to make informed decisions about patient care, improving outcomes and resource allocation.

  • Churn Prediction

Churn prediction identifies customers likely to stop using a product or service. Logistic regression models customer churn by analyzing demographic information, usage patterns, and behavior. The model assigns a probability to each customer’s likelihood of churning, enabling businesses to take proactive measures to retain them. Interventions such as targeted marketing campaigns, personalized offers, and enhanced customer support can be deployed based on these predictions, helping reduce churn rates and improve customer loyalty.

Your AI/ML Career is Just Around The Corner!

AI Engineer Master's ProgramExplore Program
Your AI/ML Career is Just Around The Corner!

Use Case: Predict the Digits in Images Using a Logistic Regression Classifier in Python

We’ll be using the digits dataset in the scikit learn library to predict digit values from images using the logistic regression model in Python.



  • Importing libraries and their associated methods

/determining

  • Determining the total number of images and labels

displaying.

  • Displaying some of the images and their labels

  • Dividing dataset into “training” and “test” set 

importing-the

  • Importing the logistic regression model

making

  • Making an instance of the model and training it

predicting

  • Predicting the output of the first element of the test set

  • Predicting the output of the first 10 elements of the test set

prediction.

  • Prediction for the entire dataset

  • Determining the accuracy of the model

  • Representing the confusion matrix in a heat map

actual-output

  • Presenting predictions and actual output

above-depicts

The images above depict the actual numbers and the predicted digit values from our logistic regression model.

Your AI/ML Career is Just Around The Corner!

AI Engineer Master's ProgramExplore Program
Your AI/ML Career is Just Around The Corner!

Want to Learn More?

If you want to kick off a career in this exciting field, check out Simplilearn’s Artificial Intelligence Engineer Master’s Program, offered in collaboration with IBM. The program enables you to dive much deeper into the concepts and technologies used in AI, machine learning, and deep learning. You will also get to work on an awesome Capstone Project and earn a certificate in all disciplines in this exciting and lucrative field.

Conclusion

We hope that this article has helped you get acquainted with the basics of supervised learning and logistic regression. We covered the logistic regression algorithm and went into detail with an elaborate example. Then, we looked at the different applications of logistic regression, followed by the list of assumptions you should make to create a logistic regression model. Finally, we built a model using the logistic regression algorithm to predict the digits in images.

If you're eager to deepen your understanding and proficiency in Python for data science and machine learning, consider enrolling in a comprehensive Python training course. Such a course can provide you with the foundational knowledge and practical skills needed to excel in the field. From mastering Python syntax to advanced data manipulation techniques and machine learning algorithms, a well-structured training program can be invaluable on your learning journey. Look for courses that offer hands-on projects and real-world applications to reinforce your learning.

FAQs

1. What is classification?

Classification is a machine learning task where the goal is to categorize input data into predefined classes or categories based on certain features or attributes. It is a type of supervised learning, where the algorithm learns from labeled training data and then predicts the class labels of unseen data. The output of a classification model is a discrete class label or a probability distribution over the classes.

2. When Do You Need Classification?

Classification is needed in various real-world scenarios where the goal is to make predictions or decisions based on input data. Some common applications of classification include:

  1. Spam Detection: Classifying emails as spam or non-spam based on their content and characteristics.
  2. Disease Diagnosis: Predicting whether a patient has a certain medical condition based on their symptoms and test results.
  3. Sentiment Analysis: Categorizing text documents (e.g., reviews, social media posts) as positive, negative, or neutral based on their sentiment.
  4. Image Recognition: Identifying objects or patterns in images and assigning them to predefined categories.
  5. Fraud Detection: Detecting fraudulent transactions or activities in financial transactions or online platforms.

3. How to Import Logistic Regression in Python?

To import logistic regression in Python, you can use the scikit-learn library, which provides a comprehensive set of machine learning algorithms and tools. Here's how you can import logistic regression from scikit-learn:

from sklearn.linear_model import LogisticRegression

After importing the logistic regression module, you can create an instance of the logistic regression model and train it using your data. Here's a basic example:

# Import logistic regression
from sklearn.linear_model import LogisticRegression

# Create logistic regression model
model = LogisticRegression()

You can then use this model to fit the training data and make predictions on new data. Remember to import other necessary modules such as numpy and pandas for data manipulation and preprocessing.

4. What role does the logistic function play in Logistic Regression?

The logistic function, or sigmoid function, converts the output of the logistic regression model into a probability between 0 and 1. This probability helps in making binary classifications, such as deciding if a user will buy an SUV or not.

5. Why is logistic regression used for classification problems?

Logistic regression is used for classification because it predicts probabilities of outcomes that fall into distinct categories. It’s ideal for determining if an event will occur or not, such as predicting whether a customer will purchase a product based on their characteristics.

About the Author

Mayank BanoulaMayank Banoula

Mayank is a Research Analyst at Simplilearn. He is proficient in Machine learning and Artificial intelligence with python.

View More
  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.